Redcedar Data Analyses Instructions

Please note this analysis and R Markdown document are in still in development :)

Approach

The overall approach is to model empirical data collected by community scientists with ancillary climate data to identify important predictors of western redcedar dieback.

Data Wrangling

Import iNat Data - Empirical Tree Points (Response variables)

The steps for wrangling the data are described here.

Format and export for collecting climateNA data

Data were subset to include only gps information to use in collecting ancillary data.

Remove iNaturalist columns and explanatory variables not needed for random forest models

Import Normals Data

Climate data then extracted with ClimateNA tool following the below process. Data were downloaded for the iNat GPS locations using the ClimateNA Tool.

ClimateNA version 7.40 -

  • Climate data extraction process with ClimateNA
    • Convert data into format for climateNA use (see above)
    • In ClimateNA
      • Normal Data
        • Select input file (browse to gps2232 file)
        • Choose ‘More Normal Data’
          • Select ‘Normal_1991_2020.nrm’
        • Choose ‘All variables(265)’
        • Specify output file
  • Grouping explored
    • data averaged over 30 year normals (1991-2020)

Variables

Note the below analysis uses the iNat data with 1510 observations. Amazing!

  • Response variables included in this analysis
    • Tree canopy symptoms (binary)
  • Explanatory variables included
    • Climate data
      • 30yr normals 1991-2020 (265 variables - annual, seasonal, monthly)

Remove specific climate variables not useful as explanatory variables (e.g. norm_Latitutde)

Seperate climate variable groupings

Normals data for 265 variables were downloaded for each point Monthly - 180 variables represented data averaged over months for the 30 year period Seasonal - 60 variables respresented data averaged over 3 month seasons (4 seasons) for 30 year period Annual - 20 variables represented data averaged for all years during 30 year period

Remove variables with variables that have near zero standard deviations (entire column is same value)

Full

There were length(normals)-length(normals.nearzerovar monthly variables with zero standard deviation is Dropping columns with near zero standard deviation removed length(normals)-length(normals.nearzerovar monthly climate variables.

Monthly

There were length(normals.monthly)-length(normals.monthly.nearzerovar monthly variables with zero standard deviation is Dropping columns with near zero standard deviation removed length(normals.monthly)-length(normals.monthly.nearzerovar monthly climate variables.

Seasonal

There were length(normals.monthly)-length(normals.seasonal.nearzerovar monthly variables with zero standard deviation is

Annual

There were length(normals.monthly)-length(normals.annual.nearzerovar monthly variables with zero standard deviation.

Remove other explanatory variable categories (binary or five categories)

Compare model errors

Five categorical response

Full Normal Model

## 
## Call:
##  randomForest(formula = reclassified.tree.canopy.symptoms ~ .,      data = five.cats.full, ntree = 2001, importance = TRUE, proximity = TRUE,      na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 2001
## No. of variables tried at each split: 14
## 
##         OOB estimate of  error rate: 43.43%
## Confusion matrix:
##                 Dead Top Healthy Other Thinning Canopy Tree is Dead class.error
## Dead Top              68     116    24              35           16   0.7374517
## Healthy               53    1027    83              80           32   0.1945098
## Other                 19     156    63              30            9   0.7725632
## Thinning Canopy       32     153    31              85            8   0.7249191
## Tree is Dead          21      52     6              12           18   0.8348624

Monthly Normal Model

## 
## Call:
##  randomForest(formula = reclassified.tree.canopy.symptoms ~ .,      data = five.cats.monthly, ntree = 2001, importance = TRUE,      proximity = TRUE, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 2001
## No. of variables tried at each split: 12
## 
##         OOB estimate of  error rate: 43.25%
## Confusion matrix:
##                 Dead Top Healthy Other Thinning Canopy Tree is Dead class.error
## Dead Top              73     115    23              34           14   0.7181467
## Healthy               53    1034    79              78           31   0.1890196
## Other                 19     159    56              31           12   0.7978339
## Thinning Canopy       32     157    28              84            8   0.7281553
## Tree is Dead          19      53     7              12           18   0.8348624

Seasonal Normal Model

## 
## Call:
##  randomForest(formula = reclassified.tree.canopy.symptoms ~ .,      data = five.cats.seasonal, ntree = 2001, importance = TRUE,      proximity = TRUE, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 2001
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 43.52%
## Confusion matrix:
##                 Dead Top Healthy Other Thinning Canopy Tree is Dead class.error
## Dead Top              69     117    25              33           15   0.7335907
## Healthy               55    1035    81              76           28   0.1882353
## Other                 20     156    60              30           11   0.7833935
## Thinning Canopy       35     156    30              79            9   0.7443366
## Tree is Dead          20      53     6              14           16   0.8532110

Annual Normal Model

## 
## Call:
##  randomForest(formula = reclassified.tree.canopy.symptoms ~ .,      data = five.cats.annual, ntree = 2001, importance = TRUE,      proximity = TRUE, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 2001
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 43.88%
## Confusion matrix:
##                 Dead Top Healthy Other Thinning Canopy Tree is Dead class.error
## Dead Top              68     121    22              33           15   0.7374517
## Healthy               56    1025    85              78           31   0.1960784
## Other                 23     160    58              28            8   0.7906137
## Thinning Canopy       31     158    27              82           11   0.7346278
## Tree is Dead          20      52     7              12           18   0.8348624

Binary Normal Model

Full Normal Model

## 
## Call:
##  randomForest(formula = binary.tree.canopy.symptoms ~ ., data = binary.full,      ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 2001
## No. of variables tried at each split: 14
## 
##         OOB estimate of  error rate: 32.35%
## Confusion matrix:
##           Healthy Unhealthy class.error
## Healthy       954       321   0.2517647
## Unhealthy     400       554   0.4192872

Monthly Normal Model

## 
## Call:
##  randomForest(formula = binary.tree.canopy.symptoms ~ ., data = binary.monthly,      ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 2001
## No. of variables tried at each split: 12
## 
##         OOB estimate of  error rate: 32.57%
## Confusion matrix:
##           Healthy Unhealthy class.error
## Healthy       941       334   0.2619608
## Unhealthy     392       562   0.4109015

Seasonal Normal Model

## 
## Call:
##  randomForest(formula = binary.tree.canopy.symptoms ~ ., data = binary.seasonal,      ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 2001
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 32.79%
## Confusion matrix:
##           Healthy Unhealthy class.error
## Healthy       944       331   0.2596078
## Unhealthy     400       554   0.4192872

Annual Normal Model

## 
## Call:
##  randomForest(formula = binary.tree.canopy.symptoms ~ ., data = binary.annual,      ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 2001
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 33.02%
## Confusion matrix:
##           Healthy Unhealthy class.error
## Healthy       940       335   0.2627451
## Unhealthy     401       553   0.4203354

Summary of model performance

Response Explanatory Vars tried split OOB Error
5 class Full 14 43.43
5 class Monthly 12 43.25
5 class Seasonal 7 43.52
5 class Annual 4 43.88
Binary Full 14 32.35
Binary Monthly 12 32.57
Binary Seasonal 7 32.79
Binary Annual 4 33.02

Identify important variables

Binary Response, Annual Explanatory Variable

Binary Response, Seasonal Explanatory Variable

Clearly all of the climate variables are highly correlated.

Lets pick the top performing metric in our random forests analyses, CMI and then any less correlated variables

Below we can check the correlation of CMI, MAP, and DD_18

Now we can check how the model performs with only these three climate variables

## 
## Call:
##  randomForest(formula = binary.tree.canopy.symptoms ~ ., data = binary.annual,      ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 2001
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 33.02%
## Confusion matrix:
##           Healthy Unhealthy class.error
## Healthy       940       335   0.2627451
## Unhealthy     401       553   0.4203354

It’s hard to give up the seasonality data, but they are all highly correlated (data not shown) and if we look at the above importance plot for the seasonality data, the winter variables (norm_CMI_wt,norm_DD_18_wt, and norm_PPT_wt) all had the highest MeanDecrease Accuracy and Gini. Therefore, even if we chose to build the model on seasonal data, we would likely want to choose to use the winter values for each variable.